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Abstract 


This paper demonstrates how a multilevel substructuring technique, called 
the Hierarchical Poly Tree (HPT), can bemused to integrate a localized mesh 
refinement into the original finite element, model more efficiently. The op- 
timal HPT configurations for solving isoparametrically square h-, p-, and 
h p-extensions on single and multiprocessor computers is derived. In addi- 
tion, the reduced number of stiffness matrix elements that must be stored 
when employing this type of solution strategy is quantified. Moreover, the 
HPT inherently provides localized “error-trapping” and a logical, efficient 
means with which to isolate physically anomalous and analytically singular 
behavior. 
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1 INTRODUCTION 


With the advent of affordable computer resources, engineers have come 
to rely upon numerical techniques to simulate various types of physical 
phenomena. These include structural mechanics, fluid dynamics, electro- 
magnetic fields, and heat transfer to name a few. The most widely used 
methods or numerically approximating this behavior are generalized Finite 
Element (FE) and Finite Difference (FD) formulations. 

Due to the limited approximation/interpolation capabilities of the afore- 
mentioned numerical techniques, the results of an analysis depend heavily 
upon the proper discretization of the system in question. Oftentimes, as 
the solution process proceeds, various localized phenomena may occur that 
would require a refinement of the model to ensure the reliability of the 
solution. Such model refinements are triggered by the occurrence of; 


1. Shock wave formation, 

2. Cracking, 

3. Material nonlinearity, 

4. Geometric nonlinearity, 

5. Boundary layer formation, and 

6. Varying boundary conditions, etc. 


To date, many mesh refinement schemes have been developed that ad- 
dress this issue. These schemes are typically classified as r-, or p- 
extensions; i.e., 


r-extension: Is a node relocation scheme which adapts the 

spatial coordinates of the nodes toward the 
optimal location. The number of nodes and 
elements are fixed when using this approach; 
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/{.-extension: Is a scheme wherein elements containing a large 

amount of error are refined into much smaller elements; 
p-ext ension: Is a method that employs higher order polynomials 

for the shape functions of elements containing 
a large amount of finite element approximation error. 

These schemes have been successfully used, both individually and con- 
comitantly, to solve PDEs of the elliptical, parabolic, and hyperbolic types 
in one and two dimensions [1-3]. Due to the large expanse of literature avail- 
able that pertains to this subject it would be impractical, if not impossible, 
to reference all of the authors that have made contributions related to the 
development of these techniques. In light of this, we will simply refer the 
reader to the cumulative works of I. Babuska and B.A. Szabo as well as the 
publications compiled in [4]. 

Regardless of whether an r-, h -, and/or p-extension is employed, several 
iterations of the solution process are required before satisfactory results can 
be attained. It directly follows that the implementation of these techniques 
is somewhat restrictive, especially for large FE systems, because of the 
large computational costs involved. In this context, we will illustrate how 
the Hierarchical Poly Tree (HPT)[5,6] solution strategy can be incorpo- 
rated into the aforementioned mesh refinement routines so as to yield more 
efficient computational algorithms for both sequential and multiprocessor 
type machines. In addition, an HPT inherently provides; 

1. The minimization of in-core and out-of-core memory requirements; 

2. A logical/efficient means of “isolating” localized mesh adaptations; 

3. Localized “error- trapping”, i.e., the influence of localized modelling 
errors are essentially confined to their “branch” of the Tree; 

4. An orderly multilevel organization of the model topology wherein 
interpolative reduction schemes can be employed between levels for 
simulations involving a hierarchy of fine to very coarse scales in its 
definition[5,6 ]. 
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In this context , the forthcoming chapters will: 

1. Provide the motivation/philosophy of utilizing the HPT for solving 
dynamically refined FE discretizations, 

2. Develop the optimal multilevel HPT for the solution of locally refined 
regions in a sequential type computing environment, 

3. Illustrate the potential advantages of using the HPT solution strategy 
in a multiprocessor environment. 


2 HPT PHILOSOPHY 

The Hierarchical Poly Tree (HPT) is a multilevel substructuring technique 
that has been shown to yield significant speed enhancements with reqards 
to the solution of the resulting system of simultaneous algebraic equations 
that arise from FE formulations. Moreover, when implemented in a multi- 
processor environment, the HPT has the potential of solving the numerical 
problem with superlinear speed enhancements, i.e. the resultant speedup is 
greater than the number of processors used in parallel. Given that the use 
of the aforementioned dynamic mesh refinement schemes typically necessi- 
tates the solution of the system equations several times before an adequate 
solution is attained, the incorporation of the HPT solution strategy into 
their respective algorithms would prove to be very advantageous. This is 
especially true when addressing large FE systems. 

The development of the HPT solution strategy evolved from the recog- 
nition that the computational effort associated with the solution of 

(A-]{n=m (2.i) 

can be approximated as [7], 

C e = ^[W^B + 1)] + 2 N e N b (2.2) 

where 


C g — total number of arithmetic operations necessary 
to solve Equation (2.1) 


3 



N e 

N b 


2 N e [N b (N b + 1)] 


2 N e N b 


number of equations/unknowns 
mean-half-bandwidth of [A ] 

number of arithmetic operations required 

to perform an \L\ [A] [L] T factorization of [A*] 

via a direct, skylined solver such as the one 
employed in ADINA [7,8] 

number of arithmetic operations needed to 
factorize {A} and perform the subsequent 
back substitution step for calculating {y}. 


Furthermore, since [A ] is symmetric, the mean-half-bandwidth is defined 
to be 


N b = - 


a 


(2.3) 


where 


a = N e 

(3 — number of stiffness matrix elements stored in the 

upper triangular of [A*] via a skylined method. 

Substituting (2.3) into (2.2) yields 



Thus, equation (2.4) is an approximation of the computational effort asso- 
ciated with the solution of (2.1) in terms of the number of equations and 
stiffness matrix elements. 

Now, consider the process of hierarchical substructuring as shown in 
Figure 2.1. Note that the 1~ level, or “trunk” of the Hierarchical Poly 
Tree, represents the final assembled composite version of the FE model. 
The V— level, or outermost “branches” of the Tree, represents the sub- 
structures comprised of fundamental finite elements, e.g. 4-node quadrilat- 
eral elements. The intermediate levels represent the various assemblages 


4 


of the hierarchy that, are used to traverse from one extreme to the other. 
Referring to Figure 2.2, the algorithmic steps associated with the grafting 
of “branch” substructures to their “root” substructure is as follows; 

1. Forward Phase 

(a) Assembly of the condensed stiffness matrices and force vectors 
from the preceeding level, 

(b) Partitioning of the local stiffness matrix into its internal and 
external components. Note that this can be accomplished by 
simply employing the proper node numbering locally, 

(c) Forward elimination/condensation of the local stiffness matrix, 

(d) Condensation of the local force vector, 

(e) Transfer of the condensed stiffness matrix and force vector to 
succeeding levels, i.e. from each branch to its root processor; 

2. Backward Phase 

(a) Backward transfer of the results calculated at. the root substruc- 
ture to its branch substructures, 

(b) Adjustment of the local internal “load” vector, 

(c.) Condensation of the locally adjusted internal “load” vector, 

(d) Back substitution, i.e. calculation of the independent, variables 
within the external periphery of the branch substructures. 

In terms of Figures 2.1 and 2.2, it follows that, each root-branch system 
must undergo the steps noted above in the Forward and Backward phases 
of the solution process. For multilevel trees, each branch processor is itself 
a root for subsequent sets of branches. 

From a sequential point of view, the HPT approach to the problem is 
to optimize 


where the superscript 5 and subscript l denote the s— substructure on the 
/— level, and 


Cts — total computational effort arising from solving 

every level and substructure in a sequential manner, 
*Nej — number of internal equations, 

\Nbj — mean-half-bandwidth of the [f A'//] partition, 

L — total number of levels, 

S(l) — number of substructures on the level. 


approximate number of arithmetic operations 
required to condense [f A ] 

approximate number of arithmetic operations 
needed to condense {* F} 
number of arithmetic operations needed to 
factorize {f F?} and perform the subsequent 
back substitution step for calculating {?!/}, 

When operating in a multiprocessor environment the various substruc- 
tures on a given level can be solved/condensed concurrently. As a result, 
the HPT is configured to minimize the following 

+ 1)] + C,N E )C l N B ) + 2 (iNe,) 

L Z ) rnaar 

( 2 . 6 ) 

where the subscript max denotes the substructure on the /— level requiring 
the maximum computational effort. By employing the definition of the 
mean-half-bandwidth, Equations (2.5) and (2.6) can be recast as 


5 ;»B I'JVfl G'JVfl + 1)] - 

(f Ne)C,n b ) - 

2 (;aw(;jv b ,) - 
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+ 2 Wi 


Cts = 


and 

C'tp = 


The optimal multilevel decomposition of a finite element discretization 
into a hierarchy of substructures can be derived by employing various func- 
tions for f3 and a[5,6]. A number of these functions for two-dimensional 
substructuring “primitives” have been generated and are compiled in the 
Appendix. Note that these functions are isoparametric in nature. For 
example, the FE models shown in Figures 2.3 and 2.4, discounting the ap- 
propiiate boundary conditions and assuming they share the same number 
of DOF per node, yield the same (3 and a. The “mechanics” of obtaining 
an optimal HPT will be shown more rigorously in the next chapter. 

In terms of the r-extension, the initial HPT description of the model 
would be “fixed”. This is because the r-extension method does not alter 
the fundamental “connectivities” of the various finite elements. In other 
words, even though the values stored in [it'] will change, (3 and a remain 
constant. On the other hand, the HPT must be able to adapt “on-the- 
fly” when employing the h-, p-, and/or /ip-extensions. In this context, the 
following chapter will illustrate the flexibility of the HPT to self adapt to 
the demands of these type of refinements. 


L S(l) 

EZ 

l=i 1=1 
L S(l) 

EE 

1=1 S=1 


\lP + 3 


a 


[I W 

2 fa 


(2.7) 


L 

£ 

/=i 




+ 2 U3A 


y [1 ilH 2 ' 

i = i / a 


( 2 . 8 ) 


3 SEQUENTIAL HPT FORMULATION 

Ideally, the initial global FE discretization would be decomposed into a 
hierarchy of substructures that have been configured according to the HPT 
optimality criteria. Thereafter, as mesh refinement is initiated, the HPT 
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would sprout multilevel root-branch systems that are optimized in the local 
sense. This is illustrated in Figure 3.1. However, regardless of whether a 
global HPT has been constructed or not, the onset of localized mesh re- 
finement is best handled by an HPT approach. Therefore, the forthcoming 
development will illustrate how localized multilevel HPT’s should be con- 
structed for the sequential solution' of regions that have been dynamically 
refined by Avay of h-, p-, and /jp-extensions. In addition, to better con- 
vey the advantages of using this type of solution strategy, we will restrict 
the discussion to isoparametrically square regions comprised of four node 
quadrilateral elements. 

Upon completion of the aforementioned, we will conclude the chapter 
with a comparison of the computational efforts associated with the various 
mesh refinement techniques. As will be seen, the differences in the number 
of arithmetic operations required to statically condense out the internal 
variables of the different methods of enhanced mesh discretization can be 
substantial. 


3.1 SEQUENTIAL HPT FOR fi-EXTENSIONS 


When performing a model refinement by the /i-extension method, the 
discretization within an element/region is increased by using smaller ele- 
ments (see Figure 3.2). Typically, the computational steps associated with 
integrating the mesh refinement into the global formulation is as follows; 

1. Static condensation is employed to remove the internal DOF, 

2. The proper interpolation constraints are implemented so as to main- 
tain element to element compatibility yielding the final representation 
of the refined zone in terms of the original DOF about the periphery, 

3. The new element /region stiffness is assembled into the global formu- 
lation. 

This sequence of steps is depicted in Figure 3.3. Now, assuming the num- 
bering scheme shown in Figure A.l, /3i, and a/, have the following form. 
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(3.1) 


— 2 [®^( 771 ) 3 — (Up ~ l)(”i) 2 + G/9J?jj 

- HpYM 3 

a h = p{ni) 2 (3.2) 

where 

p — node density, i.e. DOF per node, 
n i - nodal dimension of the problem, i.e. the number 
of nodes along an edge. 

Note that in obtaining (3.1) and (3.2) from the substructure primitive in 
Figure A.l, we have set the nodal dimensions m and n equal to each other. 
Thus, the computational effort associated with statically condensing out 
the internal variables is 

(3.3) 

To solve this problem with a two level, sequential HPT, the solution 
process is as follows; 

1. The region of refinement is decomposed into an arbitrary number of 
equivalent substructures as depicted in Figure 3.4, 

2. The individual substructures are condensed into their external DOF, 

3. The condensed substructures are subsequently assembled to yield the 
composite structure shown in Figure 3.5, 

4. The composite assembly is then statically condensed into the external 
DOF defining the periphery of the refined element /region, 

5. The proper interpolation constraints are implemented so as to main- 
tain element to element compatibility, yielding the final representa- 
tion of the refined zone in terms of the original DOF. 

6. The new element/region stiffness is assembled inlo the global formu- 
lation (see Figure 3.6). 
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Notice that the only difference between the HPT approach and the standard 
technique is in the way the internal variables are condensed out of the 
problem. 

Recalling Equation (2.7), the computational effort, of condensing out 
the internal variables of the refined element/region by a two level HPT in 
a sequential manner can be written as 


/> 


C 


i m 2 


(Kif 


+ E 

3 = 1 


1 Ml 

2 Jq 


(3.4) 


where (A' 2 ) 2 is the total number of 2— level substructures. Furthermore, 
the functionalities of iO and j/? for K 2 > 3 are, from Figure A. 2, 

ja = {2 [(A’ 2 ) 2 + A' 2 ] n 2 - [3(A 2 ) 2 + 2A' 2 - 1]} p (3.5) 

1 /? = ^{(14(A' 2 ) 3 + 7(A’ 2 ) 2 -5A* 2 )p(n 2 ) 2 

- {(37(A’ 2 ) 3 + 13(A' 2 ) 2 — 18A' 2 ]p - [2 (A' 2 ) z + 2A' 2 ]}n 2 
+ {[24(A' 2 ) 3 + 5(A' 2 ) 2 - 14A' 2 + 1 }p - (3(A' 2 ) 2 + 2A' 2 - 1]}} 

(3.6) 


The computational effort associated with the second level substructures is 
approximated by using the functions obtained from Figure A.l, i.e. 


5 a 


where 


pM 2 

f[6p (n 2 ) 3 - (lip - l)(n 2 ) 2 + 6p n 2 ] J’ 


(3.7) 


n 2 = 


ni + K : 


A' 2 


(3.8) 


Due to the construct of the decomposition, all of the second level sub- 
structures will exhibit the same computational effort. Equation (3.4) can 
then be rewritten as 


hC'rs ~ 


1 (i n , (a 2 ) 2 ( 2 / d ) 2 

2 iQ 2 2 o 


(3.9) 
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Note that, for convenience, we have discarded the use of the superscript s. 
Assuming ??i A 2 , the problem dimension of the second level substruc- 

tures is, from (3.8), 

n i ~ —■ (3.10) 

A 2 

Incorporating this assumption and (3.10) into (3.5) - (3.7) yields 

iQ ~ 2 (A' 2 + l)pni (3.11) 


1/? 



(2 A 2 -I- 1) (n x ) 2 


(3.12) 


P{n 1 ) 
(A'a) 


3p*(n,)* 

(a- 2 ) 3 

Employing (3.11) - (3.14) in (3.9) and requiring that 


(3.13) 

(3.14) 


djhCrs) 

d(A 2 ) 


(3.15) 


yields 


Ii- 


r36 i 1 / 3 

49 Wl 


; K 2 > 3 


(3.16) 


Thus, the proper number of second level substructures needed to minimize 
hC'rs is given by (3.16). The computational speedup afforded by this ap- 
proach is illustrated by forming the ratio 


Rh/TS 


Ch 

hCrs 


(3.17) 


Recall that hC'rs and Ch are the approximate number of arithmetic op- 
erations required to condense out the internal variables of the refined el- 
ement/region with and without substructuring respectively. Utilizing the 
result of (3.16), it can be shown that 


Rh/TS 


Ch 

hCrs 


0.31 (77,) 2/3 ; K 7 > 3 


(3.18) 
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Focussing on the special case of K 2 
(from Figure A.3), 


= 2, the functions for i a and 1 3 are 


and 


IQ = (6 77 1 - 9) P 

~ 6 p nj 


(3.19) 


10 = ^(6V(ni) 2 - (182 p - 12) nj + (123p - 18)] 



(3.20) 


n 2 


77 1 + 1 

2 


Hi 

2 


(3.21) 


It is easily verified that the speedup attainable from this decomposition of 
the problem is 


R, 


h/TS = 


Ch 


864 t?i 


hCrs (4225 + 216 nt) 


Ii 2 = 2 


(3.22) 


Moreover, 

lim Rh/rs — 4 ; Ii 2 = 2 (3.23) 

Figure 3.7 graphically illustrates the potential speedups that can be ob- 
tained for a sequentially solved two level HPT with K 2 = 2,3, and 4 for 
/?-ex tended mesh refinements where 7ij < 60. 

The preceeding development has been based upon the assumption that 
the problem size, 7 i 1} is large. At this juncture it is appropriate to ask the 
question, ’How large must Tij be before the benefits of the HPT are real- 
ized?’ To address this question, a number of numerical experiments were 
performed. The speedups obtained from this empirical study are depicted 
in Figures 3.8 - 3.11 and tabulated in Tables (3.1) - (3.3). It can be clearly 
seen that as the problem size increases, the speedups are as predicted by 
the foregoing development. In addition, the following observations can be 
made: 
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1. Small problems are dominated by the computational overhead. How- 
ever, for larger problems, this effect becomes negligible; 

2. Decomposing the problem into four substructures (/v 2 = 2) is more 

advantageous than nine (A* 2 = 3) for small n, because less overhead 
is accrued; 

3. As the DOF per node increases, i.e. p, the results are improved for 
small values of n,. This occurs because, for a given problem size 
Hj, the overhead becomes less influential as a result of the actual 
computational effort increasing by an order of C)(p 3 ). 

4. For larger problems, the second level substructures themselves be- 
come large enough to warrant another level of substructuring. 

It was also found, as evidenced by Tables (3.1) - (3.3), that, the system of 
equations resulting from this type of mesh refinement, can be condensed/ 
solved more efficiently by utilizing a multilevel HPT with K, = 2, / e [2, L) 
where the number of levels that should be employed is governed by 

p{"L- 1) 2 > 350 (3.24) 

Equation (3.24) arises from the fact that, whenever p ( n L ) 2 is greater than 
350, another level of substructuring is justified so long as Ii L+1 = 2. Thus, 
to determine the approximate number of levels that should be used in terms 
of p and nj, nx_! can be cast as 


nL- i 


n i + A/ — 1 

n L~2k, 


»i 

nfc 1 a/ 


Requiring that K t = 2, / 6 [2 ,£], 

n L - 1 ~ 

Substituting (3.26) into (3.24) yields 


n, 


(2)(i-2) 


A < - {ln[p (»?] ) 2 ] - 3} 


(3.25) 


(3.26) 


(3.27) 
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Before proceeding, it must be emphasized that the limiting number of 
levels is based upon empirical data. That is, the computational overhead 
incurred varies from one machine to the next and is the dominant factor in 
determining the smallest problem size, p(n 3 ) 2 , that can benefit from this 
type of solution strategy. 

To estimate the performance of an L level HPT solved sequentially, 
wherein K\ = 2, l e [2 ,L], we can write 


hCts = 5 


hil + 4 + 1C <*£ + 




2<* 




+ (2) 2(i 


-i) ijM 


where, for l € [1 ,L — 1], 

) a = ( 671/ — 9 ) p 

~ 6 p ni 


,/3 = ^[65 pint) 7 - (182 p - 12) n, + (123^? - 18)] 
65 


jp 2 (n,) 2 


(3.28) 


(3.29) 


(3.30) 


and 


71 / 




; l £ [2 ,L] 


(2)('- 1 > 

L a = p{n L f 

L0 = ^pMf - {Up - l)(n L ) 2 + Gpn L ) 
~ 3 p 2 (n L ) 3 

Using (3.29) - (3.33), we can recast (3.28) as 


hC'ts = 


4225 

192 

4225 

192 


p 3 M 3 

P 3 M 3 


L-i t 

.£ (2)"- 1 ' 
1 


+ ~p 


(2Y l ~ 2) 


+ oP 


(3.31) 

(3.32) 

(3.33) 


(n t ) 4 

2 H (2) 2 ^- j ) 

Ml 

2 ,J (2 )«*-*> 


(3.34) 
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As before, the potential speedup is estimated by forming the ratio 

Ch 


R 


h/TS = 


hC'rs 


432 (2) 2 < L - 1 >n 1 


{4225 (2) (i-1 ) [(2 )( £ -— 1 ) - 1] + 432 n,} 


(3.35) 


Equations (3.2/) and (3.35) can be used in conjunction to determine the 
appiopiiate number of levels and the concomitant speedup that can be 
expected for a given ^-extended refinement defined by p and nj. It is also 
interesting to note that, for a given fixed value of L , the speedup for large 
values of n l is 

.hni o ^ l/T s = (2) 2 < L - 1 ) ; K, = 2, l e [2 ,£] (3.36) 

Another important feature inherent to the HPT solution strategy is 
the reduction of the memory needed to store the stiffness matrix element s. 
However, even though the total storage requirements are reduced, the sub- 
structures on the 2— through L— levels must store the partition of the 
stiffness matrices containing the connectivities between the internal and 
external D.O.F., i.e. [A/e], separately before it is altered by the conden- 
sation process. This is because the internal load vector, {/*>}, of a branch 
substructure must be adjusted by way of 

{Fl}adj. = {Fj} ~ [A'/£ ]{}£;} (3.37) 

to reflect the influence of the external displacements, {Te}, calculated by 
the root. 

In this context, the number of stiffness matrix elements that must be 
stored for an l— level substructure can be written as 

Mi = j/d (3.38) 

Mj = if3 +/ fiiE ;/(E[2.Z] (3.39) 
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where 


Mi — total number of stiffness matrix elements that 
must be stored 

10 — number of stiffness matrix elements stored in 

[A'] via a skylined technique 

i0ie ~ number of stiffness matrix elements in the [A'/e] 
partition of [A”] 

When using the optimal IIPT configuration of IC, = 2, / e [2 ,£], the 
functions for if3 are defined by (3.30) and (3.33) while 

i0ie = ^-(26(n/) 2 — 84n/ + 50] 

~ yp 2 («t ) 2 ; l € [i,(A — l)] (3 - 40) 

L0IE = p 2 [ 2(n£,) 3 - 8(nx) 2 + lOn-i — 4] 

~ 2 p\n L f (3.41) 

Using (3.31), (3.40), and (3.41); (3.38) and (3.39) can be written as 


Mi 

~ ^V(»i ) 2 

( 3 . 42 ) 

M t 

~ 6 [2,(£-l)J 

( 3 . 43 ) 

M l 

. , <».)* 
bp (2) 3 < Z '- 1 * 

( 3 . 44 ) 


Then, for an L level HPT, the total storage needed lor all the various 
stiffness matrices can be cast as 

M ts = Mi + (4 )M 2 + ( 16 ) M 3 + + (2 ) 2a ~ 1] M L ( 3 . 45 ) 
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From (3.42) - (3.44), 


65 


„Mrs = f + ~p\L - 2)(,„) ! + 5/4^4 (3.46) 


(2)(^-i) 

ion in memory requirements afforded from the 


Thus, the overall reduct 
r un^ I .• J -^u^wiiculs aiiomeu irora tlie use 

of an HPT solut.on strategy can be seen by ratioing „M TS wilh A , namdy 

hMrs 


h^TS/h = 


fa 


[fp 2 ( n l) 2 + fp\L-2 )(n^ + 

3p 2 (ni) 3 


_65 [ 91(1-2) 5 

12m 127i] + 3(2 )< l 


+ ~ ^ 1 + — - 7) ; A '/ = 2, Z e [2,£] 


(3.47) 


Fixing the number of levels and letting n, become large, we see that 


lim h M 


n ! — • oo 


rs//t = 


3 (2)< i " 1 ) 


; K, = 2 , / e [2,1] 


(3.48) 


Figure 3.12 graphically illustrates the reduced number of stiffness matrix 
elements that must be stored when employing an HPT for h-extended mesh 
refinements As can be seen, an actual savings of memory can not be 

leaize unti n x > 22. The fact that, the memory requirements are increased 

or smaller h -extensions is primarily attributable to the dual storage of the 
\^ie\ partition of the stiffness matrix. 

The foregoing development has shown that the use of a multilevel sub- 
structuring approach, i.e. the HPT, for solving locally refined mesh dis- 
cretizations by ^-extension has certain distinct advantages. First the in- 
ternal variables are statically condensed out of the refined element/region 
much more efficiently and, secondly, substantial reductions in the overall 
memory requirements are achieved. In addition, although less quantifiable, 
is the local error trapping’’ provided by this technique. This is due to the 
reduction of a given degree of freedoms skyline height resulting from the 
substructured decomposition of the problem. In other words, the 'direct 
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influence of a finite element approximation error in a particular stiffness 
matrix element is confined to the degrees of freedom within the reduced 
skyline. The HPT also provides a logical and efficient means of “telescop- 
ing” an ft-extended mesh refinement into a physical anomaly or singularity. 
This is done by grafting another localized HPT onto the previous one as 
shown in Figure 3 . 13 . In the section that follows, we will develop the ad- 
vantages and quantifiable trends of using the HPT in a sequential manner 
for p- and /ip-extended mesh refinements. 

3.2 SEQUENTIAL HPT FOR p- AND hp - EXTENSIONS 

In the previous section it was shown that a locally /i-extended mesh 
discretization can be solved more efficiently by using the Hierarchical Poly 
Tree solution strategy. In this section we will show that the HPT is applica- 
ble to p- and /ip-extensions as well. As before, we will restrict the discussion 
to isoparametrically square regions. Although the development will be sim- 
ilar to the one presented for the /i-extension technique, the approach will 
be quite different. This is due to 

1. The discretization within a given element is enhanced by using higher 
ordered polynomial shape functions; and 

2. The number of elements refined in this manner is independent of the 
order of the polynomial chosen to improve their accuracy. 

In this context, to maintain compatibility with the equations used in the 
previous section, the independent variables defining this type of mesh re- 
finement will be k and 77. The interpretation of these variables is as follows: 

(k ) 2 — is the total number of elements refined by 

p-extension; and 

rj — quantifies the order of the polynomial used within the 

elements themselves by way of the number of nodes along 
an edge. 

With regards to the order of the polynomial used, the prevalent litera- 
ture typically uses complete p— order polynomials for triangular elements 
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as defined by Pascal’s triangle[9]. Therefore, to be consistent, we will define 
the p— order polynomial of a 4-node quadrilateral element as the conjunc- 
tion of two complete p— order triangular elements as shown in Figure 3.14. 
Furthermore, the variables 77 and p are related by 

V = P + 1 (3.49) 

As was pointed out by Katz, Peano, and Rossow t [10], the internal variables 
arising fiom the increase in p are condensed out at the element level. Katz 
et al. also showed that the order of the polynomial can be increased by 
enfoici ng constraints on the higher order derivatives of the shape functions 
as opposed to the use of extra spatial nodes. Regardless of which method is 
used, if only C continuity is required, the number of arithmetic operations 
needed to statically condense out the internal DOF for an element refined 
in this manner is 

C p = ^{/ > 3 [( 7 7) 6 + 9 ( 7;) 5 + 30 ( t ;) 4 - 391 ( 77) 3 + 921 ( j /) 2 - 822 ? ? + 256 ] 

+ P 2 [ 3(V) A + 30(t, ) 3 - 195(7, ) 2 + 3127, - 132] 

- /°[10 (»,) 2 - 40;, + 40]} (3.50) 

It can also be shown that the number of stiffness matrix elements for the 
previously defined p— order 4-node quadrilateral element is 

0p — ^{[( 7 ?) 3 + 6(7,) 2 — 117,-f 6]p + 277)77 (3.51) 

Equations (3.50) and (3.51) were derived for the numbering scheme given in 
Figure A. 4. Moreover, after condensation, the resulting form of the refined 
element is shown in Figure 3.15. I 11 addition, the order of the complete 
polynomials, i.e. p, are usually restricted to values of 8 or less because of the 
numerical error incurred while calculating the appropriate coefficients[ll]. 
It then follows that 

7 , < 9 (3.52) 

Referring to Figure 3.16, an isoparametricallv square region comprised 
of («) 2 elements refined by p-extension is integrated into the original, global 
FE model by performing the following algorithmic s<eps: 
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1. The internal DOF generated by the p-extension refinement are con- 
densed out at the element level; 

2. The {k) 2 refined elements are subsequently assembled; 

3. The internal DOF associated with this assemblage are condensed into 
the DOF about the periphery of the refined region; 

4. The appropriate interpolation constraints that will maintain element 
to element compatibility are applied; and 

5. The refined region is assembled into the global FE model via the 
original DOF about the periphery. 

The number of arithmetic operations needed to perform steps 1) and 3) 
can be approximated as 

C hp = 1^1! + (k) 7 C p (3.53) 

^ ^ hp 

where, from (3.5) and (3.6), 

Q;,p = {2 [(k) 2 + k] 7 7 - [3(tc) 2 + 2k - 1]} p (3.54) 

Ph P = ^{[14(tc) 3 + 7(/t) 2 — §k\p (ij) 2 

- {(37(k) 3 + 13(k) 2 - 18k]/9 - [2(k) 2 4- 2k}}i] 

4- { [24( et) 3 + 5 (k) 2 — 14/c + 1 ]/o — [3(<c) 2 + 2 ac — 1]}} 

(3.55) 

Assuming k 3> »? > 3 yields 

othp ~ />(k) 2 (2?? - 3) (3.56) 

2 

0h P ~ y(K) 3 [14(i/) 2 -37 j/ + 24] (3.57) 
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Substituting (3.56) and (3.57) into (3.53) gives the approximate computa- 
tional effort entailed in performing a large scale p-extended mesh refine- 
ment, that is 


a 


hp 


p 3 4 (14 t7 2 - 377/ + 24) 2 


+ k 2 C b 


(3.58) 


8 (27/ - .3) 

Note that, due to the way it was formulated, (3.58) is also applicable to ftp- 
extensions. This is because it is written in terms of the number of elements 
and the order of the polynomial enhancement within them, i.e. (tc ) 2 and 
?/ = (P + 1) respectively. 

In the previous section it was shown that an optimal sequential solution 
of an ft-extended mesh refinement can be obtained from a HPT configured 
such that A/ = 2, / £ [2, A]. The forthcoming discussion will show that 
the same is true when addressing p- and ftp-extended mesh discretizations. 
To set the stage for the generalized L level case, we will illustrate the 
development of a simple two level HPT where K 2 = 2. Recalling (3.9) and 
(3.50), the computational effort for a two level HPT can be written as 


r 1 ( i 0) , ( A ' 2 ) 2 ( 2 /?) 2 . ( 

hp^TS - + — + ( k ) Cp 


2 jo 


2 2 a 


Rewriting (3.19) and (3.20), 


(10) 2 + 

2 W)! + 

(3.59) 


2 a 


t 

i a ~ 

Op77 1 

(3.60) 

~ 

^ 2/ \2 
-£p ( n u 

(3.61) 


where, from Figure 3.17 and appealing to Equation (3.8), 

»i = k(»/ ~ 1) + 1 

- k.(i / - 1 ) 


(3.62) 
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Substituting (3.62) into (3.60) and (3.61) gives 

lQ ~ 6 y 0 /v ( 77 — 1 ) 


(3.63) 


65 

1/? ~ — p 2 (/t) 2 (;; - l) 2 


(3.64) 


Once again referring to Figure 3.17 and assuming that (/c/2) > 3, we see 
that the operative functions for 2 a and 2 (3 are equivalent to (3.5) and (3.6), 
namely 



(3.65) 


,0 =2/ 
' 2 [ 

14 ( 

i)’+ 7 


) p(v? 

-{ 

37 G 

;) 3 + i, 

0 


D] [ 2 (l) +2 (I)])' i 

+ { 

24 (i 

\) +5 ( 


) +i] P - [s (|) + 2 ( 1 ) - 1 }} 


(3.66) 

Applying the assumption that k is large and much greater than if yields 

~ P (^) (2 V ~ 3) (3.67) 


2 0 


P_ 

2 


(§)W 


37 Tj + 24] 


(3.68) 


Utilizing (3.63), (3.64), (3.67), and (3.68); (3.59) can be recast as 


Cts = 


hpl-'TS 


4225 
192 1 
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(2i/ - 3) 


+ ( K ) 2 


(3.69) 



T' ,e resultin S speedup obtained from such a decomposition is 

C\ 

hpC T S 


Rh P / T s = 


(g!( r -)4 ll4(.,) :> -37„-!-24p t u .\2 r , \ 

I £ 1 ; (2.,-3) + (M C p j 


{m 5 /> 3 (*) 3 (>f - 1)> + + {K yc p } 

(3.70) 

Speedups obtained for isoparametrically square regions refined by a p- 
extension and decomposed into a two level HPT is shown in Figures^ 18a 

and 3.18b. Upon inspection of these graphs the following observations can 
be made 


'■ Tl,e S1> «dup afforded by an HPT increases as the problem size in- 
creases; and 

2. For a given «, speedup may actually improve for a higher order of p. 

Lastly, before expounding the generalized sequential L level HPT it can 
be shown that, for a fixed value of p, the asymptotic speedup is ’ ' ‘ 


Ji™ Rhp/TS = 4 ; Ii 2 = 2 (3.71) 


Note that (3.71) correlates with the result of (3.23). 

Moving on to the L level HPT, the recursion formula of (3.31) can be 
written m terms of k and p, namely 


77 i 

(2)</-l) 

*(>/ ~ 1) 

(2)<'-u (3.72) 
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Substituting (3.72) into (3.29) and (3.30) yields, for l £ [1,(L — 1)], 


„ *(»? - 1) 

(3.73) 

65 ,(«)»(,, -1) ! 
~ 4 9 (2) 2 0 _1 > 

(3.74) 

1 ii0) 2 4225 3 («) 3 ('?-l) 3 

2 ; Q ~ 192 9 (2) 3(,_1 ) 

(3.75) 


Finally, appealing to (3.67) and (3.68), 


lO- 

~ P - 3) 

(3.76) 

I# 


(3.77) 

1 (l?) 2 

p 3 (tc) 4 [14(t/) 2 — 37?/ -|- 24] 2 

(3.78) 

2 ia 

8 (2) 4 ( L -4 (27/ -3) 


Employing (3.50), (3.75), and (3.78); the number of arithmetic operations 
required to solve an L level HPT sequentially in terms of k and 77 is 


Cts — 


hp^-'TS 


IZZpVl + (K y Ci 

1=1 1 


,a 


4225 

192 




X-l j 


p 3 (k) 4 [14(t 7 ) 2 — 3777 + 24] 2 2 

~T — / An , r t\ /rv o \ ’ 1*7 °‘p 


8 ( 2 )2(L-i) (27/ — 3) 


4225 


192 


~ D* 


( 2 ) {L - 


2 ) 


+ 


p* (k) 4 (14(7 7 ) 2 - 377/ + 241- 


8 ( 2) 2,L_1 * 


(2-/ -3) 


+ ( x ) ’ C p 


(3.79) 
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Ration, g (3.58) with (3.79) give* the approximate speedup that can be 
expected from employing an L level HPT. Moreover, fixing L and ,,, the 

speedup of a localized p- or /?p-exf ended mesh refinement where k is large 
can be shown to be 

Ji™ R^p/ts = (2) 2(Z,_1) ; K, = 2, / € [2. L] (3.80) 

The results of (3.36) and (3.80) indicate that asymptotically large mesh 
discretization refinements that are isoparametrically square, regardless of 
whether its an ft-, p -, or ftp-extension; can be solved with similar computa- 
tional improvements when using multilevel HPT decompositions. 

Before the relative storage requirements of the standard and HPT so- 
lutions can be quantified for p- and/or ftp-extensions, some subtle issues 
pertaining to this type of mesh refinement, must be addressed. For exam- 
ple, if the internal DOF within the p-extended elements themselves are to 
be calculated, then each individual element stiffness matrix must be saved 
for the back substitution phase of the solution. In addition, the elements 
unaltered [K 1E ] partition must be saved separately so that the internal load 
vector can be formulated as per (3.37). Furthermore, even if the internal 
unknowns are not desired, one may still wish to save the individual element 
stiffness matrices anyway. This arises from the fact that the recalculation 
of the entire element stiffness matrix can be avoided if an increase in the 
order of p is needed as the solution progressed from one iteration to the 
next. As was shown by Katz et. al.[10], special nodal variables can be 
created so that the updated element stiffness matrix can be constructed 

by simply appending the rows and columns of the new DOF to the initial 
matrix. 

Keeping the aforementioned issues in mind, the number of stiffness ma- 
trix elements stored when not using the HPT can be written as 

Alhp = fihp + k~(/3 p + p fiis) (3.81) 
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where 


/3 p = Equation (3.51) 

= Equation (3.55) 

P0IB = |[(6 t 7 3 - 11^7 2 + 5 t ? -+- 2)^ + 4(7 7 - 1 )] (3.82) 


Note that pfliE accounts for the dual storage of the connectivity partition 
of the stiffness matrix for the p-extended elements. If the internal DOF of 
the individual elements are not calculated, this term can be discarded from 
(3.81). Moreover, if the element stiffness matrix is regenerated from scratch 
when needed. /3 P can be thrown out as well. 

For the HPT, the functions given by (3.42) and (3.43) for an /i-extension 
are applicable to p- and Ap-ex tensions as well. Writing them in terms of k 
and ?; yields 


fi c i 

Mi ~ ^pV( 7 ? -l ) 2 


( 3 . 83 ) 


M, ~ ;' 6 [ 2 .(£- l )] ( 3 . 84 ) 

The number of stiffness matrix elements stored for the L— level can be 
represented as 

k 2 

Ml = (l/3 +l Pie) + +p0ie) ( 3 . 85 ) 

Assuming that t lie L— level substructure is comprised of at least nine 
p-extended elements, i.e. 


( 2 ) (i - 


i) 


3 


( 3 . 86 ) 
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Using (3.81) and (3.88), the relative reduction in the number of stiffness 
matrix elements that must be stored by the HPT, for a large number of 
p-extended elements, is 


bm hp M TS/hp 


1 (22t; 2 — 57ry -f 36) 

(2)< L- U (147/ 2 — 37t7 + 24) ’ R ' ~ 2 ’ / G f 2 ’ 1 ] 

(3.89) 


Setting the HPT solution strategy aside for a moment, it is worthwhile 
to note the significant difference in the number of computational operations 
required to condense out the internal variables of an element refined by h- 
extension as opposed to p-extension. That is, assuming (lie same number 
of DOF per node and ii] = 77, the number of arithmetic operations required 
to statically condense out the internal DOF of an h- extended element is. 
from (3.3), on the order of 0[p 3 ( t),) 4 ] and. from (3.42). on the order of 
^[p 3 ! 7 ?) 0 ] for the p-extendecl version. 
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This observation is significant in t hat the p-extension, for a fixed number 
of DOF. is generally accepted to be the more accurate of the two meth- 
ods^ 1,12]. However, as was just shown, this occurs at the cost of being 
much more computationally intensive. In the next section we will discuss 
more rigorously the implications of the relative computational efforts asso- 
ciated with ft-, p-, and ftp-extended elements. 

3.3 COMPARISON OF ft-, p-, AND ftp- EXTENSIONS 

To date, a great deal of effort has been expended to determine the rela- 
tive accuracy of the ft-, p-, and ftp-extension techniques[3, 11-13]. The crite- 
ria generally used to make this comparison is the amount of error incurred 
for a given number of DOF. But, with the same number of DOF, the static 
condensation process for the various methods of localized mesh refinement 
can have substantially different computational costs. These differences will 
be quantified on a relative basis in this section. Obviously, there are other 
aspects of the solution, beyond the actual condensation process, that can 
affect the overall computational effort. However, our objective here is to 
simply point out that the actual CPU time required to obtain a given degree 
of accuracy might be a more relevent basis of comparison. 

To begin the discussion, we will compare the methods of ft- and p- 
extensions as applied to a single element. Before proceeding, however, 
certain aspects of the comparison must be clarified. For example, when 
referring to a p-extended element, it should be understood that we are 
addressing the conjunction of two complete p~ order triangular elements 
as described in the previous section. In addition, since it is not clear as 
to whether the internal DOF of a p-extended element will be calculated or 
not, we will restrict the discussion to the computational effort associated 
with condensing the stiffness matrices and load vectors. In other words, the 
back substitution phase of the solution will not be considered. With this in 
mind, the number of arithmetic operations required to statically condense 
out the internal variables of an ft-exf ended element is. from (3.3), 

Ch - ^ 3 (”i) 4 (3-90) 
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For a p-extended element, the number of computations required to per- 
form the same operation is, from (3.50), 


: 7 f(v + 9) 


(3.91) 


Since we are addressing the refinement of a single element, n, and are 
equivalent, see Figures 3.2 and 3.14. That is, assuming the same number 

° DOF per node (p), it, and t/ define the same total number of DOF 
Therefore, from (3.90) and (3.91), 


108 

ni(ni + 9) 


(3.92) 


As can be seen from (3.92), the amount of computational effort for a single 
element refined by ^-extension exceeds that of an h -extended one when 

7,1 = 7? - To further conve y the difference in the cost of the two approaches 
a plot of the actual and theoretical ratio of C h /C p as a function of n, is 
shown m Figure 3.19. This difference can be accounted for by the realization 
that a p-extended element gives rise to a stiffness matrix that is very nearly 
full while an h-ex tended one is essentially banded. The significance of this 
observation can be better appreciated when you consider that, from (2.4), 


1 w 

2 07, 

w 

2 Q„ 


(3.93) 


(3.94) 


c*h = ol p = p(n 1 ) 2 


(3.95) 


0h = Equation (3.1) 


It then follows that 


= Equation (3.51) 


(3.96) 
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Thus, any relative reduction in the number of stiffness matrix elements 
afforded by an ^-extension is amplified by the nonlinearity of (3.96). As an 
example, from Figure 3.20, 


^ ~ \ ; U X =7, = 15 (3.97) 

Hp ** 

Substituting (3.97) into (3.96) gives 

0 1 

7 T ~ 7 ; n-i = V = 15 (3.98) 

6 n 4 


Note that the result of (3.98) correlates with the value shown in Figure 3.19 
for the prescribed level of refinement. 

Proceeding onto a comparison of the ft- and ftp-extension techniques, 
the number of arithmetic operations needed to condense an ftp-extended 
element/region can be approximated as, from (3.58) and (3.91), 


C 


hp 


p 3 4 (147 7 2 - 37?/ + 24) 2 
8 * (2?; - 3) 


+ ^«V(»7 + 


(3.99) 


Recalling that n* can be written in terms of k and ?/ by (3.62), C\ can be 
recast as 

C h ~ - 1) + l] 4 (3.100) 

Using (3.99) and (3.100), Figure 3.21 shows the resulting ratio of C\/Ch p 
for various values of k and i]. As can be seen, Ch is greater than Chp 
for all the combinations of k and i] presented. Since it was shown earlier 
that C p is greater than C*,, this result may seem to be contradictory. The 
explanation for the apparent discrepency is as follows. When performing 
an ftp-extension, the individual p-extended elements condense out their 
internal DOF before they are assembled into the composite version of the 
refined mesh. Recall that this sequence of algorithmic steps is illustrated 
in Figure 3.16. As a result, an ftp-extension constructed in such a manner 
is inherently a two level HPT wherein the 2— level substructures are the 
p-extended elements. Then, just like the HPT, if the size of the 2— level 
substructures (p-extended elements) becomes too large with respect to their 
subsequent assemblage defined by AA (k)? the computational ellort will 
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beccne su hoplnnal. Thus, if the order of the polynomials wiihin the p. 

es' In C T , • 77 °"7 ^ "■*«* <«**. ft would he 

Fil e i 21**’ T 1 , r, ,0r * f T ““ small ' r of -C can he seen from 

.gure 3.21. Toohtani this relation, however, would require orders of p 
that are impractical and typically not used. F 

To get a better feel for the computational efficiency obtained from per- 
oinung an ftp- ex ten si on m the aforementioned manner, we will decompose 
the ^-extension into the same number of substructures as p-extended ele- 
men s, i.e. A 2 - k and n 2 = V . Consequently, the only difference between 
he two approaches is in the effort of condensing out the internal DOF of 
the substructures on the 2^ level. Thus, in terms of « and 


X' 


TS 


P 3 4 ( 14?/ 2 - 37 ?/ - j - 24) 2 9 

^ K — _ 


( 2 n - 3) 

Ratioing (3.101) with (3.99) gives 

iX'ts 


+ ^ V (v-l) 4 


C 


h P 


< i ;«•>»?> 3 


( 3 . 101 ) 


( 3 . 102 ) 


More quantitatively, Figure 3.22 shows h C TS /C hp over the same range of 
values or k and , 7 used in Figure 3.21. Comparing these two plots with 
each other, we see that the use of localized condensation methods yields a 
more favorable comparison of computational efficiency for an h-extension 
Moreover, as the number of substructures («) increases, the ratio h C T s/C, 
approaches unity. This occurs because the assemblage of the substructures 
is dominating the solution. Since the composite version of the mesh refine- 

ZTjlOl] SamC for both methods > if stands to reason that, from (3.99) 

,• iXts 

-“ftT " 1 0.103) 

Based upon the foregoing, it, is apparent that, the use of substructurine 
techniques can significantly impact the relative computational costs In 
his context, we will compare the h- and ftp-extension techniques when they 
have been hierarchically substructured into their respective optimal HPT‘s. 
In terms of the parameters K and ?/. the minimum number of arithmetic 
operations for an /,-extension is, from (3.34). 


;C 


TS = 


4225 

192 


PV(,7— l) 3 


0 _ 


1 


(2)<£-2) 


+ 9 A*(»7 ~ l) 4 
2 
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( 3 . 104 ) 


where 


|{ln[/=>(/t ?7 - k : + l) 2 ] - 3} - 1 < hL opi , < \ (ln[/7(/c?7 - k + l) 2 ] - 3} 

( 3 . 105 ) 

The computational effort for an /jp-extension disseminated into its optimal 
HPT is, from (3.79), 


hpC TS 


4225 

192 


p 3 («) 3 (,,-l) 3 


2 - 


+ 


P 3 (*) 4 [14 (vl 

8 ( 2 ) 2(L_1 ) 


( 2 )< L - 2 ) 
37?7 + 24 ) 


(2 J ? - 3) 


+ ( K ) 2 Cp 


( 3 . 106 ) 


Since t he functionality of (3.106) was derived with the assumption that the 
L— level substructures are comprised of at least nine p-extended elements, 
the maximum number of levels that can be employed is obtained from the 

constraint 

2 

> 3 2 ( 3 . 107 ) 

Solving for L yields 


K 

(2)t L_1 * 


ln(«/3) T ln(2/c/3) 
ln(2) ~ hp ' opt ~ ln(2) 


( 3 . 108 ) 


Referring to the plot of hCrs/ hpCrs, he. Figure 3.23, it is very interest- 
ing to note the strong influence that an additional level of substructuring 
can have on the efficiency of an /ip-extension. This influence is manifested 
through the racheting behavior clearly seen at the appropriate values of k, 
i.e. k = 6,12, and 24. The effect of additional levels does diminish, how- 
ever, as k gets larger. This is evidenced by the progressively smaller “step** 
sizes at the transition values of k. Furthermore, the relative efficiency of 
an /i-ext ension improves as 77 increases. This is due, in part, to the fact 
that an /?-ext ension can add more levels to handle the increase in the total 
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numbei of DOF arising from larger values of 7/. Conversely, the number of 
le\els foi an frp-exteiision is strictly controlled by the number of p-extended 
elements (ac). 

Lastly, regardless of the value of Figure 3.23 shows that hC T s/h P C T $ 
is approaching unity as k becomes large. This is indicative of the fact that 
the lower levels of the HPT are dominating the solution time. Since the 
total DOF have been constrained to be the same for both methods, the fr- 
aud frp-ex tensions have equivalent computational efforts for the lower levels 
of the HPT. From this we can conclude, if the total DOF is sufficiently 
large, that the multilevel HPT provides a computational efficiency which 
is invariant to the type of fundamental finite element used in the model. 

Another facet of the solution process that could significantly impact the 
total CPU time required to perform an fr-, p-, or frp-extension is the genera- 
tion and assembly of the additional elements. I 11 general, both the number 
and type of element should be considered. In many instances, however, 
onl} one element actually needs to be created. Occasions such as this arise 
v» hen the lefined mesh is comprised of elements with the same geometry 
and aspect ratios. Although our comparison of the various methods has not 
taken this part of the solution into account, it has sufficed to show that the 
computational effort not only varies from one technique to the other, but 
is strongly dependent upon the solution strategy as well. I 11 this context, 
we have satisfied our objective. That is, the actual CPU time required to 
obtain a given degree of accuracy should at least be included in ail} 7 real 
comparison of the various mesh refinement techniques. 


4 PARALLEL HPT FORMULATION 

In the previous chapter it was assumed that, the multilevel substructural 
decomposition of the locally refined mesh discretization had to be con- 
densed/solved one substructure at a time. Since multiprocessor computers 
are becoming more commonplace, the forthcoming development will be 
based upon the premise that the substructures occurring on any particu- 
lar level can be condensed/solved concurrently. As was shown by Padovan 
and Cute [6], this approach to the solution of FE type numerical mod- 
els can yield significant computational improvements. In many instances 
the speedup will be even greater than the number of processors used, i.e. 
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superlinear. 

Superlinearit y is a measure of the processor usage efficiency. More 
specifically, superlinearity is the ratio of the effective speedup with the 
number of processors used, i.e. 


where 


_ -R g /rp 


(4.1) 


S g — superlinearity 
R 9 /tp “ effective speedup 

$ — number of processors used. 


It is the opinion of the authors that this approach to measuring the ef- 
ficiency of processor usage is more appropriate than other conventional 
measures. This arises from the fact that, as will be seen, the number of 
substructures/processors required to obtain the optimal effective speedup 
is problem dependent. From this it follow-s that arbitrarily disseminating 
an FE model into the same number of substructures as there are available 
processors will typically lead to suboptimal results. In addition, the effec- 
tive speedup will be determined by comparing the effective computational 
effort of the parallel HPT solution with that of the standard sequential so- 
lution. This is done for tw r o reasons. First, it is a measure of speedup that 
the general FE user community can identify with as a result of their almost 
exclusive use of single processor, sequential type computers. Second, using 
the standard sequential solution as a reference forms the basis from which 
the efficiency of all parallel solution algorithms can be compared. 

Overall, there are many factors that will affect the actual speedups 
obtained on a multiprocessor computer. These include: 

1. The communication/data bus structure of the processor network; 

2. The degree of sophistication of the resident compiler; 

3. The amount of globally shared memory and the concomitant access 
efficiency; 

4. The speed of the individual processors themselves, etc. 
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In this context, the objective here will be to simply illustrate the potential 
advantages of using the HPT solution strategy to integrate a local mesh 
i efinement into the initial FE model when a parallel network of processors 
can be exploited. This will include trends associated with the following: 

1. Elfective speedup; 

2. Approximation of superlinearity; and 

3. Reduction of memory requirements. 

Moreover, we will show how the superlinearity of the solution can be im- 
proved, without drastically degrading the potential speedup, by implement- 
ing a technique called Top-Down, Partial Sequentialism (TDPS)[6]. Fur- 
thermore, based on our earlier comments, it will be assumed that: 

I* *1 he time requned to tiansfei data from one level of the hierarchy to 
the next is negligible; 

2. All of the piocessors share the same computational capacity as the 
sequential reference; and 

3. Each piocessor has enough local in-core memory to store the data of 
its assigned substructure. 


4.1 PARALLEL HPT FOR /i-EXTENSIONS 

In Chapter 2 it was shown that the effective computational effort asso- 
ciated with the parallel solution of a hierarchically substructured FE model 
can be approximated as 



or, more succinctly, 

L 

C T P ^ ^ ( C J ) moa . (4.3) 

/ = ! 
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where 


(Ci) 


moi 


1 (;W 



(4.4) 


Since the substructures occurring on any particular level are constructed so 
that they are computationally equivalent, we will, for convenience, dispense 
with the max subscript. Recall that this type of substructure construction 
is possible because we have restricted the discussion to isoparametrically 
square regions of mesh refinement. 

For a two level HPT, wherein the substructures on the level can be 
condensed/solved simultaneously, (4.2) can be written as 


C'tp 


1 (i n 1(20? 

2 id 2 2 o 


(4.5) 


Addressing the solution of a local h -extension, the appropriate functionali- 
ties for / q and ,/?, l € [1,2], are defined by (3.11) - (3.14). Note that (3.11) 
and (3.12) are only valid for I< 7 > 3. Substituting these functions into 
(4.5) yields 


hC'rp 


49 3 (2A' 2 + i) 2 

ii' “dT+TT n ‘ 


, 9 3 <"i r 

~r ~ 0 

r (K 2 y 


(4.6) 


Thus, the approximate number of processors/substructures that should be 
used on the 2— level is determined by satisfying 


djhCrp) 
d(K 2 ) 


(4.7) 


Solving (4.7) yields 


/72 \ 1/5 

I<2 ~ \49 n V i n i^ 165 (4.8) 

K 2 = 3 ;n,<165 (4.9) 

Under most circumstances the value of ??, can be expected to be less 
than 165. Consequently, for our present purposes, we will assume that 
A 2 = 3. Substituting this into (4.6) gives 


,C 


2401 


O/ \3 , 3/ ,4 

TP ^ “jTJ-p ("1 ) + (»1 ) 


(4.10) 
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Ratioing C;,, as given bj (3.3). with (4.10) gives t lie effective speedup of 
this decomposition. More specifically. 


Rh/TP 


Ch 

h C'tp 


[^sfp 3 ^) 3 + iiP 3 ("i) 4 ] 

2592 77 ! 

(21609 + 32 iij) 


To set the stage for the measure of superlinearity, we can write the 
number of processors of an L level HPT, in general, as 


<f> = 



For L = 2 and I\ 2 = 3, (4.12) gives 


(4.12) 


$ = 1 + (A ' 2 ) 2 


= 10 ( 4 - 13 ) 

Ratioing (4.11) with (4.13) gives the approximate superlinearity that can 
be expected, i.e. 


5, 


Rh/TP 

$ 


^ 2592n * (4.14) 

10(21609 + 32?ii ) 

As was stated in the introduction of this chapter, the actual perfor- 
mance of an HPT on a given multiprocessor computer is dependent upon 
several factors. However, by using the empirical data obtained for the se- 
quential HPT, we can predict the actual speedup and super linearil y of a 
parallel IIPT within a reasonable percentage of error. Thus. Figures 4.1 
through 4.4 graphically show the correlation of the theoretical speedups and 
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superlinearities with (lie “actual*' for various problem sizes when L = 2 and 

K 7 = 3. 

For the case of K 2 — 2, the operative functions of ja and are defined 
by (3.19) and (3.20) respectively. Incorporating these into (4.5) yields 

4225 Q 

hC "~T92 ' 3(ni)S + 32' ?3(,,l)4 (4 - 15) 


Once again, forming the appropriate ratio with (3.3), i.e. C f /,, and (4.15) 
gives the speedup for a two level HPT with K 2 = 2. Namely, 


^ 864n 1 

h/TP (4225 + 54? ij) 


(4.16) 


From (4.12), the number of processors required to obtain the speedup de- 
fined by (4.16) is 

$ = 5 ; L = 2, K 2 — 2 (4.17) 

It then follows that the superlinearity of this particular decomposition is 

_ 864?}! 

h ~ 5(4225 + 54n ! ) (4.18) 

Figures 4.5 - 4.8 show how the speedup and superlinearity vary with prob- 
lem size for L = 2 and K 2 = 2. Moreover, from Tables (4.1) and (4.2), the 
following observations can be made: 

1. The use of four processors (K 2 — 2) on the second level provides faster 
speedups than nine (K 2 = 3) for problems where ?? 3 < 55; 


2. The magnitude of superlinearity is greater when using five processors 
instead of ten for problem sizes in the range of interest; and 

3. Regardless of whether four or nine processors are used on the 2— 
level, both the speedup and superlinearity improve as the problem 
size increases. 


Furthermore, from Table (4.3). a three level HPT with I\ 2 = K ^ = 2 
provides a better effective speedup than a two level tree lor modest values 
of In this context we can conclude, as with the sequential HPT. that 
the optimal effective speedup for a locally /?-extended element /region can 
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be obtained with a parallel HPT configuration wherein K, — 2, / G [2 L] 
i.e. ’ ’ J ’ 

R 6048( 2) 4 ( L ~ 1 >n 1 

h/TP " {4225(2)«i-i>[ 8 - (2p^ + 6048,,,} (4 * 19) 

This conclusion amends the result of A*, = 3, / € [2,1], that was given 
m (6J where the special case of K, = 2 was not investigated. Fixing the 
number of levels and allowing m to become large, the asymptotic speedup 
can readily be seen to be 


Rh/TP = (2) 4( ^- 1 » ; K, = 2, l € [2, L] (4.20) 


Moreover, the number of levels that should be employed to attain the op- 
timal effectn e speedup can be approximated by satisfying 

> 160 (4.21) 

As with Equation (3.24), (4.21) was determined from empirical data and is 
likely to vary from one machine to the next. Using the recursion formula 
of (3.31), (4.21) can be recast in terms of i?j, namely 


P(”i ) 2 

(2) 2 U -2 ) 


> 160 


Solving for L yields 


L< ^{ln[p(n, ) 2 ] - 2.3} 


(4.22) 


(4.23) 


Although (4.23) will give the number of levels that will yield the optimal 
effective speedup, the use of this many levels can severely degrade the 
super linearity of the network. This phenomena is clearly indicated in Table 
(4.3). To maintain the speedups afforded by the addition of more levels, 
while improving the superlinearity, we can employ the technique of Top 
Down, Partial Sequentialism[6]. TDPS, as the name implies, performs 
the condensation of the higher levels (Top) of the HPT in a sequential 
manner while solving the lower levels in parallel. The fact that TDPS will 
not significantly degrade the effective speedup can be seen by forming the 
ratios 

0//> = jr \ I £ [1,11 (4.24) 

C i, 
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Let ting Ii'i = 2 , / € [2,L], where L > 2, Equation (4.24) can be written as 


Q//i 

(-'l/u 


4225 


1 


864 (2) 3(,_1 • U] 

1 


;/ € [1,(L - 1)) 


( 2 )4(L-i) 

From (4.25) we see that 

C 1/h = 8C' 2/h = 64C' 3/ h = • • • = (2) 3 < l - 2 >C' (L _ 1)/;i 
I n addition, C\L~i)/h is greater than C'l/h so long as 

4225, , 

771 < W (2) 


(4.25) 

(4.26) 


(4.27) 


(4.28) 


Equ at ion (4.27) clearly shows t hat the computational effort of the higher 
levels is much less than the lower ones. It then follows that solving the 
higher levels sequentially will not impinge upon the overall effective speedup. 
Consequently, the superlinearity of the solution can be improved because 
substantially fewer processors are used to obtain essentially the same speedup 
This is evidenced by Table (4.4) where the 2— level processors were also 
used to solve the 3— level substructures sequentially. In general, the com- 
putational effort, associated with the use of the TDPS technique can be 
written as 


192 (2) 3(,_1 t 
4225 (2) 2 6 -i+ C 3 


[ 192 (2) 3 C-i) 


[4225 p 3 ^) 3 

ht-TDPS ~ 2 ^ 

L - 1 

+ £ 

l=L-C 

, 9 (2) 2£ 3/ ,4 

+ 2 ( 2 )ML-1) P ( x ' 

4225 

*192" 

9 

2 




P 3 M 3 [\[S - (2)- 3(L - £ - 2, l + (2) ,4+3£ - u, [(2) £ - 1 ]] 

(4.29) 


+ ^ 3 (h 1 ) 4 (2) ,4+2£_4L) 
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where 


C - the number of higher levels solved sequentially, 0 < C < (L - 1) 

L - the total number of levels, including (.hosed solved sequentially 

The total number of processors necessitated by the use of TDPS can be 
written in terms of L and C also, namely 

L-c 

$ = i + 

1=2 

= ^[(2)"^ c '-l| (4.30) 

The i eduction in the total number of stiffness matrix elements that 
must be stored for the parallel HPT is the same as that given by (3.47) and 
(3.48). However, for networks that do not have globally shared memory 
capabilities, it is worthwhile to note the reduced memory requirements on 
a per processor basis. Utilizing (3.38) and (3.39) with the proper function- 
alities foi id and i0je , the fractional memory needs of the processors on 
different levels are given by 

M,/h = ^h (4.31) 

Figure 4.9 depicts (4.31) for l 6 [1,(L — 1)] in terms of the parameter ii 1 . 
As can be seen, the use of the HPT solution strategy on a parallel network 
of processors for h-extended mesh discretizations can significantly reduce 
the memory demands placed on a given processor. This is especially true 
for processors employed on the higher levels of the Tree. 

If the TDPS technique is used, care must be taken to ensure that the 
available memory resources of the processor performing the computations 
of the higher levels does not become saturated. In terms of the variables 
L, C, and n i ; the number of stiffness matrix elements that must be stored 
by an (£. — £)— level processor when using TDPS is 

tdpsM(L-c) = M l _c + (4)M a _ c+1] + •.. + (2)- c M l 

- p 3 (n 1 ) 2 [91£(2) 2<£ - z -> -f 5n )l (2) ( - +2r - 3L) ] 

(4.32) 
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It lias been shown t hat the dissemination of an /i-extended element/region 
into a multilevel hierarchy of substructures can provide substantial, even 
superlinear, speed enhancements. To improve the solution characteristics 
even further, more standard parallel solution schemes could be implemented 
in conjunction with the HPT. For example, since the processors on levels 
that are not currently “active” are essentially “idle”, they could be used 
to perforin the condensation of the “active” substructures via the “Parallel 
Active Equation Solver” developed by Farhat and Wilson [14]. Moreover, 
the assemblages of the lower levels of the hierarchy would lend themselves 
very well to such solution techniques because of their relatively large skyline 
heights. 

4.2 PARALLEL HPT FOR p- AND hp-EXTENSIONS 


The advantages of using an HPT solution strategy on a parallel network 
of processors were presented in the previous section for h-extencled mesh 
refinements. In this section we will reformulate the parallel HPT in terms 
of the parameters used to describe p- and /ip-extensions, that is k and 
Once again, we will be concerned with demonstrating the effective speedup, 
superlinearity, and memory requirements provided by the HPT approach. 

To begin the discussion, recall that the approximate computational ef- 
fort of solving this type of mesh refinement without a HPT was given by 
(3.58). Now, assuming that the 2— level substructures are solved concur- 
rently, the effective number of arithmetic operations incurred by a two level 
HPT can be approximated as 


hpCrp 


i(i PY , 1 (»/?)» 

2 ja 2 2 a 



(4.33) 


Note that (4.33) also employs the assumption that the internal DOF of 
the individual p-extended elements are condensed out simultaneously by 
the processors they were assigned to on the 2— level. Letting A 2 = 2, 
Equations (3.75) and (3.78) can be used to rewrite (4.33) as 


hpCpp 


4225 

792 




P 3 k 4 ( 14»/ 2 - 37>; + 24 ) 2 
728” (2;/ - 3) 



( 4 . 34 ) 
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Ratioing (3.58) with (4.34) gives the effective speedup obtained for a par- 
allel, two level HPT where k » ,, > 3 and K 2 = 2 . Figure 4.10 shows the 
potential speedup as a function of k and 77 for this type of HPT decom- 
position. It can also be shown that as the number of p-extended elements 
becomes large that the speedup is bounded by 

Jnn Rhp/TP = 16 ; K 2 = 2 (4.35) 

This result is consistent with the two level asymptotic speedup given by 
(4.20) for large /?-extended elements/regions. 

The number of processors used for a two level Tree when K 2 = 2 is five, 
see (4.17). The superlinearity of the HPT for this set of circumstances is 
shown in Figure 4.11. As can be seen from Figures 4.10 and 4 . 11 , both 
speedup and super linearity improve as the number of p-extended elements 
is increased. 

For a general L level HPT wherein Ii, = 2, / € [ 2 ,£], the approximate 
computational effort as a function of k and 7 ; is 

4225 

h P C T p ~ j^p 3 /c 3 (7/ - 1) 3 [8 - (2)~ 3(i-2) ] 

pV (147, 2 - 3777 + 24) z 
8(2)4(^-i) (2?/ - 3) 

K 2 

^ p (4.36) 

The speedup potential for a given number of levels, L, from (3.58) and 
(4.36), is 

hm R hp/TP = (2) 4(L_1 ) ; K, = 2, / e [2, L) ( 4 . 37 ) 

The bounded speedups given by (3.36), (3.80), (4.20), and (4.37) are 
indicative of the fact that the substructure assemblages occurring on lev- 
els less than the L— have relative computational efforts that are inversely 
proportional to the problem size. In other words, for h -extensions, 

7r = C>( — ) ; / e (1.(1- - 1)] (4.38) 

Lh \n } / 

or, for p - and hp-ex tensions, 

7 C = 0 \znrrr] ; *€ |i.<i - (4.39) 
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\ 


In this context, the speeclups obtained from a multilevel HPT is only limited 
by the number of levels which can be used to scale down the FE model from 
its global form to that of a much smaller L— level substructure. 

From the perspective of superlinearity, the number of processors for an 
L level HPT with I<i = 2, / € [2 ,L], can be written as 


$ 


|[( 2 ) 2£ ' - 1 ] 

(2) 2L 

3 


( 4 . 40 ) 


Using (4.37) and (4.40), the superlinearity for an asymptotically large mesh 
refinement with a fixed value of L is 

.. c lim«-^oc Rhp/TP 


— (2) 2L ( 4 . 41 ) 

16 V ’ 


For more typical values of k, the most efficient use of the processor 
network would be obtained by using the technique of TDPS. In general, 


hpC-TDPS 


4225 


192 


-pV( 7 / - l) 3 { * [8 - (2)- 3 < l - c " 2) ] + (2) (4+2C - 3L) [(2) £ - 1]} 


+ ( 2 ) 


2 C 


p 3 * 4 (147, 2 - 37I/ + 24) 2 

8 (2)4(£-i) (2t, — 3) 


+ 


C 

(2)2(L-d^p 


( 4 . 42 ) 

Figures 4.12 and 4.13 show the speedup and superlinearity for various val- 
ues of k and 7 , when 1 = 3 and C = 1. Comparing these with Figures 
4.10 and 4.11, one can see that the technique of TDPS not only improves 
superlinearity, but, for larger numbers of p-extended elements, can enhance 
the overall effective speedup as well. 

The total number of stiffness matrix elements stored by the HPT for 
this type of mesh refinement is given by (3.88) and (3.89). On a per proces- 
sor/substructure basis, the relative storage requirements can Ire posed in 
the same manner as (4.31). The operative functions for p- and /7p-extensions 
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are given by (3.81) and (3.83) - (3.85). More specifically, 

(4,43) 


(4.44) 

(4.45) 

Comparing (4.43) with (4.44) we see that the storage requirements on the 
first level exceeds those of the processors on the 2 — through (L — 1 )~ levels, 
that is 

Ml/hp > Rfljhp > > Ml/hp > *** > ^/(L-l)//ip (4.46) 

In this context, the first level forms an upper bound of the memory re- 
quirements for the processors used on the first (L — 1 ) levels. This upper 
bound is shown in terms of k and 77 in Figure 4.14. As can be seen from 
Figure 4.14, the relative reduction in the number of stiffness matrix ele- 
ments that must be stored on a per processor basis improves as the number 
of p-extended elements increases. Upon inspection of (4.45) it is apparent 
that the relative storage requirements for the L— level processors depends 
on L. k. , and 77 . However, since the relative storage requirements for the L— 
level decreases as L increases, the limiting fractional storage requirements 
for this level occurs when L = 2 . Thus, from (4.45), it can be shown that 

M L/hp <^ ;?/< 10 ( 4 . 47 ) 

To conclude our discussion on the use of the HPT for p- and frp-extended 
mesh discretizations in a parallel computing environment, the total number 
of stiffness matrix elements that must be stored by the (L — £)— level 
processor when using TDPS is 

TDPS^f{L-C) — {2 ) { ^ l ^ C) M } + {2Y c M l 

t=L-C 

- 91pV(7/ - l) 2 C(2) 2{C ~ L) + (2 ) 2C M l (4.48) 


!A 2 (y-i ) 2 

[fihp + k 2 (/ 3 p + p i3je)} 




I^hp + ^p+pM] ’ 

[(L0 +L 01 e) + (2 )2U-i| (fip +p 0IE)} 

\0h P + K? {0p +p 0Ie)] 


45 



Equation (4.48) can be used to ensure that the processor performing the 
calculations of the higher levels will not have its local memory resources 
saturated when employing TDPS. Note that this applies only for machines 
that do not have “globally” shared memory facilities. Furthermore, (4.48) 
is similar to (4.32) in that the only differnce is in the amount of storage 
necessitated by the L— level substructures. This arises from the fact that 
(4.48) can account for the storage of the stiffness matrices associated with 
the p-extended elements and their subsequent assemblage into the L— level 
substructure. 


5 SUMMARY 


This paper has demonstrated how a multilevel substructuring technique, 
called the Hierarchical Poly Tree (HPT), can be used to integrate a local 
mesh refinement into the original finite element model more efficiently. The 
optimal HPT configurations for solving isoparametrically square regions of 
mesh refinement on single and multiple processor computers was derived. 
Moreover, it was also shown that the HPT inherently reduces the total 
number of stiffness matrix elements that must be stored. For example, 
an /i.-ex tension of an element/region can be solved sequentially on a single 
processor computer with a speedup approximated by 

_ C ±_ 432 (2)^-1) ni 

h/TS hC T s ~ {4225 (2) (L “ 1 ) [(2)( L-1 > — 1] 4- 432 i. } 

(5.1) 


As can be seen, the speedup afforded by the HPT is dependent upon the 
size of the mesh refinement and the number of substructuring levels used, 
i.e. 7 i i and L respectively. However, for a given value of L, the asymptotic 
speedup for large ni is 

lim Rh/rs = (2) 2(i_1) (5.2) 

ti j — 'oo 

In addition, the fractional number of stiffness matrix elements that must 
be stored when using the HPT solution strategy was shown to be 


M- 


Mrs (911-117) 


TS/h = 


M h 


127?j 


+ 


3(2 1' 1 - 11 


(5.3) 
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The number of levels that should be employed to achieve the optimal 
speedup must be determined empirically. That is, the smallest substruc- 
ture size that the refinement can be subdivided into, without degrading 
the solution time, is machine dependent. This is because computational 
overhead varies from one machine to the next, and is the dominant factor 
in determining this parameter. Theoretically, the number of levels that 
should be used to obtain the most efficient solution would only be limited 
by the fact that the L— level substructures must contain more than one 
fundamental finite element[6]. 

To address the solution of an ftp-extended element /region, the degree 
of mesh refinement was defined by the variables k and 7 /. More specif- 
ically, (k) 1 2 is the total number of p-extended elements and r/ quantifies 
the complete p— order polynomial used within the elements by way of the 
number of “external” nodes along one edge of the periphery. Note that 
an ftp-extension of a single element is computationally equivalent to the 
assemblage of (k) 2 p-extended “global” elements of the same polynomial 
order. The speedup obtained by using the sequential HPT for this type of 
mesh refinement is given by 

n C'hp Equation (3.58) 

rr/ip/rs = y , — ~ 77 p 7 (5.4) 

hpL'TS Equation (3. *9) 

However, if k is large, the relation 

ni ~ k(i] - 1) (5.5) 

can be used in conjunction with (5.1) to approximate the resulting speedup 
within a reasonable percentage of error. 

As with the ft-extension, the HPT solution strategy can reduce the total 
number of stiffness matrix elements that must be stored for an ftp-extension. 
The actual magnitude of these savings, however, is dependent, upon some 
subtle, but significant issues. These include: 

1. If the internal DOF within the p-extended elements themselves are 

to be calculated, then each individual element stiffness matrix, along 
with its unaltered [A/e] partition, must be saved for the back substi- 
tution phase of the analysis; or. 
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2. If the special nodal variables developed by Katz et.al. [10] are used, 
the element stiffness matrix may be saved to avoid recalculating it 
from scratch whenever an increase in the order of p is needed as the 
solution progresses from one iteration to the next. 

With this in mind, the relative number of stiffness matrix elements that 
must be stored when utilizing the HPT with respect to the standard hp- 
extension solution is 


,, hpMrs Equation (3.88) 

hpMrs/hp = —T} ~ — — (5.6) 

M) ip Equation (3.81 ) 

It needs to be pointed out that Equations (3.81) and (3.88) have ac- 
counted for the extra storage required to store the element stiffness matri- 
ces and corresponding [Kie\ partitions. If these matrices do not have to be 
stored, the appropriate terms can simply be discarded. For a large number 
of p-extended elements, it was shown that 


lim 


hpMrs/hp 


1 (22?/ 2 — 57?/ + 36) 

(2)( i ' _1 > ( 14?7 2 — 37 t; -f 24) 


(5.7) 


The advantages of using the HPT solution strategy on a multiprocessor 
computer were also presented. Machines of this type provide the capability 
to condense/solve the substructures on any particular level concurrently. 
Oftentimes the effective speedup that can be obtained from exploiting this 
technology is even greater than the number of processors used. In partic- 
ular, an ^.-extension solved in this manner will yield an effective speedup 
of 

Ch 6048(2) 4(L-1 b?i 


Rh/TP = 


(5.8) 


h C TP {4225(2)d L-1 >[8 - (2)- 3 ( L - 2 )] + 60487?!} 

Recall that this measure is made relative to the standard sequential solu- 
tion. This was done for the following reasons: 


1. It is a measure of speedup that the general FE user community can 
identify with as a result of their almost exclusive use of single proces- 
sor, sequential type computers; and, 


2. Using the standard sequential solution as a reference forms the basis 

from which the efficiency of all parallel solution algorithms can be 
compared. 
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As with the sequential HPT, the potential speedup depends upon t lie prob- 
lem size, 7?j, and the number of substructuring levels, L, employed. But. 
fixing the value of L and letting become large yields 

bin Rh/TP = { 2) 4(L_1) (5.9) 

Although (5.9) indicates that the use of more levels would enhance the 
overall speedup that could be attained, one must consider if the increased 
number of processors that this would require is warranted. It was in this 
context that a measure of the processor usage efficiency, called superlinear- 
ity. was defined. Specifically, it is the ratio of the effective speedup with the 
total number of processors/substructures used. This approach to quantify- 
ing the efficiency of the solution was chosen because the optimal number of 
processors/substructures that should be used to achieve the best speedup 
is problem dependent. Since the number of processors for an optimal HPT 
is 

* = |[(2) 2£ - - 1] 


~ ~ — (5.10) 

the superlinearity for an /i -extension can be approximated as 

5 _ Rhttp 18144(2)< 2J - 4 >n 1 

h 4- ~ {4225(2) 4 t £ — 1 >[8 - (2)- 3 < L - 2 )] + 6048^} ' ’ 


Thus, from (5.11), a two level HPT will provide a speedup that is greater 
than the number of processors used when 77 2 > 40. For a three level HPT, 
the problem size must be such that iii > 130. 

To achieve the speedups afforded by the addition of more substructur- 
ing levels, and still be computationally efficient for smaller sizes of 77 a 
technique called Top-Down, Partial Sequentialism (TDPS)(6] can be used. 
TDPS takes advantage of the fact that the higher levels of 1 he hierarchy 
represent a small portion of the total computational effort. Subsequently, 
the substructures 011 the higher levels of the HPT can be solved sequentially 
by processors assigned to the lower levels without significantly impinging 
the overall solution time. As an example, for an /^-extension with 11 ^ = 37. 
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it. was shown that the third level substructures of a three level HPT solved 
sequentially by the second level processors would yield a superlinearity 
greater than one while still “conserving” 81 percent of the speedup. Both 
the superlinearity and effective speedup obtained from using TDPS improve 
as the problem size increases. 

For /ip-extensions solved by a parallel HPT, the effective speedup that 
can be expected, in terms of k and 7 /, is 


Rhp/TP — 



hpC'TP 


Equation (3.58) 
Equation (4.36) 


(5.12) 


Using (5.12), it can be shown that the limiting speedup for a given number 
of substructuring levels is 

lim Rhp/rs = (2) 4(i_1) (5.13) 

> 00 


Note that the asymptotic speedups provided by sequential and parallel 
IiPT’s are invariant with respect to the mode of refinement, used. I 11 other 
words, regardless of whether the refined mesh discretization is a large scale 
h- or /^extension, the relative speedups will be the same. In fact, it was 
shown that the actual solution times will be essentially the same for a 
given number of DOF. This follows from the observation that the T— level 
substructures, that are comprised of the basic finite element assemblages, 
represent a small portion of the overall solution time. Consequently, the 
actual CPU time required to solve a large, isoparametrically square mesh 
refinement via an HPT will not be significantly affected by the type of 
finite elements used! Simply put, it does not matter whether 3 or 6 node 
triangular, 4 or 8 node quadrilateral, etc. elements are used; the HPT 
solution strategy will yield the same relative speedup and CPU time. It 
must be reiterated, however, that this is only true for an asymptotically 
large number of DOF. 

Without the use of the HPT, the computational effort for solving 5-, 
p-, and /ip-extensions can differ substantially for the same number of DOF. 
As an example, assuming the same number of DOF per node and — j/, 
the relative number of arithmetic operations required to condense out the 
internal DOF of single /?- and p-extended elements is 


Cp 



(5.14) 
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E< r fl °, n ( f.'l 4) Cleaily S,lows ,lla< ,hcse ‘wo types of refinement have sig- 
nificantly different computational costs. Turning our attention to h- and 
up- ex tended element/regions, the relative computational effort for perform- 
ing the condensation process is, in terms of k and ij, 


t'h ^ Equation (3.100) 

C\p Equation ( 3 . 99 ) (5.15) 

Note that (5.15) was derived with the assumption that the internal DOF 
of the individual p-extended elements are condensed out before they are 
assembled. For values of 77 < 15, it was shown that 


Ch>C hp (5.16) 

Thus, the relative computational effort involved in condensing/solving the 
various modes of refinement is, for a fixed number of DOF, 


Cp > C h > C hp (5.17) 

This comparison of the various methods was performed to illustrate that 
the practice of comparing their relative accuracy on a DOF basis may be 
misleading. From a pragmatic point of view, it is our opinion that the 
actual amount of CPU time required to obtain a certain degree of accuracy 
may be a more relevant form of comparison. 

In closing, the HPT has been shown to be computationally efficient and 
less demanding of memory resources. From a more philosophical perspec- 
tive, the HPT solution strategy also provides: 


1. Localized “error-trapping”. This occurs because a given DOF, as a re- 
sult of the hierarchical substructuring process, will have a compacted 
column height.. Consequently, the reduced coupling with other DOF 
diminishes the direct influence that a finite element approximation 
error can have on the rest of the model. 


2. A means with which to “telescope" into a physical anomaly or analyt- 
ical singularity by grafting another localized HPT onto I lie previous 
one. ^ Tins is a logical, efficient way of traversing from “coarse" to 
“fine” scales of model definition. 
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FIGURE CAPTIONS 


2.1) Schematic of Multilevel Hierarchical Poly Tree (HPT) 

2.2) Schematic of Root-Branch System 

2.3) Finite Element Model of Transverse Fluid Flow Within Tubular 
Bundles 

2.4) Finite Element Model of An Arch 

3.1) Schematic of a Multilevel HPT for a Localized Mesh 
Refinement 

3.2) h-Extended 4-Node Quadrilateral Element 

3.3) Schematic of Algorithmic Steps for an h-Extended 
Element/Region 

2 

3.4) Dissemination of an h-Extended Element/Region into (K^) 
Substructures 

3.5) Assemblage of Statically Condensed Substructures 

3.6) Schematic of the Algorithmic Steps Associated with the 
First Level of the HPT 

3.7) Theoretical Speedups for Sequential Two-Level HPTs Applied 
to an h-Extension (K^ = 2,3, and 4) 

3.8) Speedup for Sequential, Two-Level HPT Applied to an 

h-Extension (p = 1, - 2, L = 2) 

3.9) Speedup for Sequential, Two-Level HPT Applied to an 

h-Extension (p = 2, = 2, L = 2) 

3.10) Speedup for Sequential, Two-Level HPT Applied to an 

h-Extension (p = 1, = 3, L = 2) 

3.11) Speedup for Sequential, Two-Level HPT Applied to an 

h-Extension (p = 2, = 3, L = 2) 

3.12) Fractional Number of Stiffness Matrix Elements That Must 
Be Stored When Using an HPT for an h-Extension 

3.13) Diagram of "Telescoping" Technique 

3.14) p-Extended 4-Node Quadrilateral Element 

3.15) Static Condensation of a p-Extended 4-Node Quadrilateral 
Element 


54 



3.16) Schematic of the Algorithmic Steps Associated with a Region 
of p-Extended Elements (hp-Extended Element/Region) 

3.17) Dissemination of a p-Extended Region of Elements 

hp Extended Element/Region) into Substructures 

3.18) Theoretical Speedups for Sequential, Two-Level HPT Applied 

to a Region of p-Extended Elements (hp-Extended Element/ 
Region ) 

3.19) Relative Computational Effort of an h- and p-Extended 
Element 

3.20) Fractional Number of Stiffness Matrix Elements for an h- 
and p-Extended Element 

3.21) Relative Computational Effort of h- and hp-Extensions 

3.22) Relative Computational Effort of "Equivalent" h- and 
hp-Extensions (K ? = k, n_ = q) 

3.23) Relative Computational Effort of h- and hp-Extensions 
Decomposed into Their Respective Optimal Sequential HPTs 

4.1) Speedup for Parallel, Two-Level HPT Applied to an h-Extension 
(p=l, K 2 =3, L=2) 

4.2) Speedup for Parallel, Two-Level HPT Applied to an h-Extension 
(p = 2, K 2 = 3, L = 2) 

4.3) Superlinearity for Parallel, Two-Level HPT Applied to an 
h-Extension (p=l, K 2 =3, L=2) 

4.4) Superlinearity for Parallel, Two-Level HPT Applied to an 
h-Extension (p = 2 , K 2 = 3 , L = 2 ) 

4.5) Speedup for Parallel, Two-Level HPT Applied to an h-Extension 
(p = 1, K 2 = 2, L = 2) 

4.6) Speedup for Parallel, Two-Level HPT Applied to an h-Extension 
(p = 2, K 2 = 2, L = 2) 

4.7) Superlinearity for Parallel, Two-Level HPT Applied to an 
h-Extension (p = 1, Y = 2, L = 2) 

4.8) Superlinearity for Parallel, Two-Level HPT Applied to an 
h-Extension (p = 2, 1< 2 = 2, L = 2) 

4.9) Fractional Number of Stiffness Matrix Elements That Must Be 
Stored When Using an HPT for an h-Extension on a Per 
Processor Basis 
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4.10) Theoretical Speedup of Parallel, Two-Level HPT Applied to 
a Region of p-Extended Elements (hp-Extended Element/ 

Region ) 

4.11) Theoretical Super linear ity of Parallel, Two-Level HPT 
Applied to a Region of p-Extended Elements (hp-Extended 
Element /Reg ion) 

4.12) Theoretical Speedup of Parallel, Three Level HPT With 

One Level of Sequent ialism Applied to a Region of p-Extended 
Elements (hp-Extended Element /Reg ion) 

4.13) Theoretical Super linear ity of Parallel, Three-Level HPT 
with One Level of Sequentialism Applied to a Region of 
p-Extended Elements (hp-Extended Element/Region) 

4.14) Fractional Number of Stiffness Matrix Elements That Must Be 
Stored for the First Level of the HPT When Applied to a 
Region of p-Extended Elements (hp-Extended Element/Region) 

A.l) Numbering Scheme Employed for h-Extended Element/Region 

A. 2) Numbering Scheme Employed for Assemblage of Substructures 

i 3> 

A. 3) Numbering Scheme Employed for Assemblage of Substructures 

«l * 2) 

A. 4) Numbering Scheme Employed for p-Extended 4-Node Quadrilateral 
Element 
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TABLE CAPTIONS 


3.1) Speedups for Two-Level, Sequential HPT Applied to an 

h-Extension with = 2 

3.2) Speedups for Two-Level, Sequential HPT Applied to an 

h-Extension with = 3 

3.3) Speedups for Three-Level, Sequential HPT Applied to an 
h-Extension with K 2 = K 3 = 2 

4.1) Effective Speedups and Superlinearities for Two-Level 

Parallel HPT Applied to an h-Extension with K =23 
and p = 1 2 ’ 

4.2) Effective Speedups and Superlinearities for Two-Level, 

Parallel HPT Applied to an h-Extension With K =23 
and p = 2 2 * 

4.3) Effective Speedups and Superlinearities for Three-Level, 

Parallel HPT Applied to an h-Extension with = 2 

4.4) Effective Speedups and Superlinearities for Three-Level 
Parallel HPT with One-Level of Sequentialism ( £ = l) 
Applied to an h-Extension (K^ = K = 2) 
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APPENDIX A 

The appendix is comprised of four figures which illustrate the 
numbering schemes employed for the substructuring primitives used 
in the development of the HPT. Note that 

1) The "internal" degrees of freedom (DOF) are 
numbered first; and, 

2) The "external" DOF lie on the periphery of each 
substructuring primitive. 
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FIGURE 2 . 1 
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FIGURE 2 . 3 
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FIGURE 3 . A 
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